Big Mart Sales Prediction Exploration

by Shuyu Wang

Link for the dataset: https://www.kaggle.com/devashish0507/big-mart-sales-prediction/data

This is a dataset created by the data scientists at BigMart.
They have collected 2013 sales data for 1559 products across 10 stores in
different cities. Also, certain attributes of each product and store have
been defined. The aim for them was to build a predictive model and find out
the sales of each product at a particular store.
Here I use this dataset to do a data exploration using R.

Load data to start

Load data and have a look

##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility  
##  FDG33  :  10    Min.   : 4.555   LF     : 316     Min.   :0.00000  
##  FDW13  :  10    1st Qu.: 8.774   low fat: 112     1st Qu.:0.02699  
##  DRE49  :   9    Median :12.600   Low Fat:5089     Median :0.05393  
##  DRN47  :   9    Mean   :12.858   reg    : 117     Mean   :0.06613  
##  FDD38  :   9    3rd Qu.:16.850   Regular:2889     3rd Qu.:0.09459  
##  FDF52  :   9    Max.   :21.350                    Max.   :0.32839  
##  (Other):8467    NA's   :1463                                       
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1232   Min.   : 31.29   OUT027 : 935     
##  Snack Foods          :1200   1st Qu.: 93.83   OUT013 : 932     
##  Household            : 910   Median :143.01   OUT035 : 930     
##  Frozen Foods         : 856   Mean   :140.99   OUT046 : 930     
##  Dairy                : 682   3rd Qu.:185.64   OUT049 : 930     
##  Canned               : 649   Max.   :266.89   OUT045 : 929     
##  (Other)              :2994                    (Other):2937     
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985                    :2410   Tier 1:2388         
##  1st Qu.:1987              High  : 932   Tier 2:2785         
##  Median :1999              Medium:2793   Tier 3:3350         
##  Mean   :1998              Small :2388                       
##  3rd Qu.:2004                                                
##  Max.   :2009                                                
##                                                              
##             Outlet_Type   Item_Outlet_Sales 
##  Grocery Store    :1083   Min.   :   33.29  
##  Supermarket Type1:5577   1st Qu.:  834.25  
##  Supermarket Type2: 928   Median : 1794.33  
##  Supermarket Type3: 935   Mean   : 2181.29  
##                           3rd Qu.: 3101.30  
##                           Max.   :13086.97  
## 

Check Missing Values

## 
##  FALSE   TRUE 
## 100813   1463
##           Item_Identifier               Item_Weight 
##                         0                      1463 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                         0                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                         0

Data Visualization

Univariagte Plots Section

From this plot we can see that the weight of the items are distributed pretty equally.

This is a left skew distribution

After cleaning the ‘LF’, ‘low fat’, ‘reg’, the data looks much better From this bar plot we can see that the Low Fat products is the bigger part of the sales comparing with Regular fat

We know that Fruits and Vegetables and Snack Foods are the first two
categories in all sales. Maybe people are thinking that, eating lots of
healthy food like fruits and vegetables gives them the permission to eat
some snacks.

The type 1 supermarket is the main form of BigMart, over 5000 places are type 1 supermarket.
Others are all around 1000 stores.

From this plot we can have a sense of how the price distributed. We see that
there are several obvious price range in this distribution.
Sub 100, around 100, sub 200, and above 200 are the four price range for the
products where most of the them landed in around 100 and sub 200 price range.

Univariate Analysis

What is the structure of your dataset?

There are 5,681 records in the dataset with
12 features(Item_Identifier, Item_Weight, Item_Fat_Content,
Item_Visibility, Item_Type, Item_MRP, Outlet_Identifier,
Outlet_Establishment_Year, Outlet_Size, Outlet_Location_Type,
Outlet_Type and Item_Outlet_Sales).

Item_Identifier and Outlet_Identifier are id columns.
We are not getting too much information out of them.

Other observations:

  • The median and mean value of Item_Weight are very close
  • Low Fat items are almost twice as much as the Regular items
  • Fruit and Snack are leading the Item_Type column

What is/are the main feature(s) of interest in your dataset?

Outlet_Size, Outlet_Location_Type Item_Outlet_Sales I think that the location of the outlet will influence the
sales a lot since the location determine the demographic of
the people who are going to that perticular outlet. Certain
people from certain area are very likely to have similar purchasing habit.

What other features in the dataset do you think will help
support your investigation into your feature(s) of interest?

Item_Type and Item_MRP might help since they have such a impact on the sales.

Did you create any new variables from existing variables in the dataset?

No

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I cleaned the Item_Fat_Content column, changed the ‘LF’ and ‘low fat’ to ‘Low Fat’
and changed ‘reg’ to ‘Regular’ to make the data more clean.
Since all ‘LF’ and ‘low fat’ are the same as ‘Low Fat’

Bivariate Plots Section

Item Weight vs. Sale

no specific relationship between item weight and sales

Item Fat Content and Item Type

We can see that Regular fat only win on the Meat category.
People are choosing low fat more almost on every category.

Item Fat Content vs. Sale

no specific relationship between item fat content and sales

Item_Visibility vs Sale

a very large amount of sales has been obtained
from products whose visibility is less than 0.2

Correlation between numerical features

From this correlation we can see that the Item_MRP has a pretty
strong correlation with the Item_Outlet_Sales.
The rest numerical columns are not heavily correlated.

Item_Type vs Sale

Fruits and Vegetables contribute to the highest amount of outlet sales.
This make sense considering the previous count plot which showed that
Fruits and Vegetables rank at the first place as well.

Outlet Identifier vs. Sales

The number 027 store contributed the largest amount of the sale.

Multivariable Analysis

Relationship between Item_MRP and Sales at different outlet type

We can see that Item_MRP and Item_Outlet_Sales has a obvious
relationship among 3 types of supermarkets, but the relationship
is not very obvious in grocery store.

Relationship between Item_MRP and Sales at different outlet type

The very obvious observation is that there’s only Tier 3 has all the types of outlet.
Both the low fat and regular sales in supermarket 3 and tier 3 are higher than others’ sales.

We see that in grocery stores, item visibility almost doesn’t affect sales at all,
whereas the in super market types, it seems like while the visibility go higher,
the sales is going down. But this trend is not very strong in all four types.
So it may safe to say that the visibility doesn’t have a strong relationship with
sales.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The visibility is not a very strong indictor in terms of the sales.
This is telling us that the products with high visibility are not going to sell
much better than those with lower visibility. So people know what they want
when they are going to this kind of place. I can imagine that people know they
need some paper towels, even though the chocalate has a very high visibility,
they are not going to buy it. They just went straight to the target.
On the contrary, I can imagine that when people go to shopping malls, they don’t
really know what they want, and just look around, in this case, the product has
high visibility will have higher probability to draw their attention.

Were there any interesting or surprising interactions between features?

Yes. Low fat products are not that popular among defferent situations,
such as locations, outlet types. So people are loving regular fat products.
Only supermarket type 1 has store in all three tiers,
so supermarket type 1 may be the most popular type among all the types.

Final Plots and Summary

Plot One

This plot shows that the Item_Outlet_Sales has similar
distribution between Low Fat and Regular. However the amout
is very different. The amount of Low Fat products peaks at 500,
whereas the regular products peaks around 300.
This plot shows that there’s obvious difference on the surface
which is the number of products. But within each fat content, the pattern is similar.
This reflect that the structure of how people consume is similar with each fat content.

Plot Two

This plot shows that the most correlated feature with Item_Outlet_Sales is Item_MRP,
among all other numerical features.
The information showed by this plot is very well aligned and very clear.
There are numbers giving a precise sense of how strong they are correlated,
besides, the plots are giving a general sense of how they correlated and their own distribution.

Plot Three

This plot shows that the correlation between Item_MRP and Item_Outlet_Sales.
We can see that there’s obvious positive correlation beteen
Item_MRP and Item_Outlet_Sales in three supermarket types,
however there’s no obvious relationship between the two in grocery stores.
So that’s on big thing to keep in mind that the strong correlation
in general does not mean that strong correlation in every categories.

Reflection

This data set is pretty good. One thing I did is that
clean the Item_Fat_Content column to rename ‘LF’ and ‘low fat’ to ‘Low Fat’
and ‘reg’ to ‘Regular’. Then this column becomes much better.
We can see that the Item_Fat_Content has a obvious impact on sales from the various perspective.
The low fat products have a better sale than regular fat products.
One of my struggles is that during the multivariable analysis phase.
I tried many possible ways of combining variables and show them to look for insights.
However the results from many of the combination are not providing a very good insights.
Anlysis of some of the numerical columns went fairly well since I can do many plots with them,
such as histogram, bar plot, boxplot and so on. I thought locations are going to be a important factor to the sales,
but it turned out that location does not have that much of a impact on sales.
Maybe that’s because people have convenient transportation methods to get to
the location of the outlet if they really need to purchse what they need. What I think is generally useful for many data analysis use cases is that
plotting the correlation matrix is very important for getting to know my data.
The correlation matrix of the numerical data will tell you the story between features,
so that you can get an idea of how the features have influence on eahc other and how they
are going to impact on the models.